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Abstract 

Microbial identification is a central issue in microbiology, in particular in the fields of infectious 
diseases diagnosis and industrial quality control. The concept of species is tightly linked to the 
concept of biological and clinical classification where the proximity between species is generally 
measured in terms of evolutionary distances and/or clinical phenotypes. Surprisingly, the informa¬ 
tion provided by this well-known hierarchical structure is rarely used by machine learning-based 
automatic microbial identification systems. Structured machine learning methods were recently pro¬ 
posed for taking into account the structure embedded in a hierarchy and using it as additional a priori 
information, and could therefore allow to improve microbial identification systems. 

We test and compare several state-of-the-art machine learning methods for microbial identi¬ 
fication on a new Matrix-Assisted Laser Desorption/Ionization Time-of-Flight mass spectrometry 
(MALDI-TOF MS) dataset. We include in the benchmark standard and structured methods, that 
leverage the knowledge of the underlying hierarchical structure in the learning process. Our results 
show that although some methods perform better than others, structured methods do not consistently 
perform better than their "flat" counterparts. We postulate that this is partly due to the fact that stan¬ 
dard methods already reach a high level of accuracy in this context, and that they mainly confuse 
species close to each other in the tree, a case where using the known hierarchy is not helpful. 


1 Introduction 


Microbial identification is the task of determining to which species a microorganism isolated from a 
clinical or industrial sample belongs. It plays a central role in the diagnosis of infectious diseases 
and industrial quality control. In the clinical setting, identification is often the first step towards a 
finer characterization of the microorganism, aiming in general to establish its virulence and/or antibiotic 
resistance profiles, which is ultimately used by the clinician to prescribe a therapy. 

Since the proof of concept of bacterial identification with MALDI-TOF MS ( Anhalt et al.\ 1975| , , 
this high-throughput technology has been improved up to a genuine paradigm breaking technology in 
microbiology, allowing to quickly, cheaply and efficiently characterize a microorganism ( [Bizzini et al.\ 
2010 Cherkaoui et al.\ 2010| Gaillot et al.\ 201 1| Tan et al\ 20121. Starting from an isolated colony 


of the targeted microorganism, MALDI-TOF MS provides a snapshot of its proteomic content. Such a 
proteomic fingerprint is highly species specific, and can be used fo identify a microorganism by matching 
it with a reference database of annotated fingerprints ( van Belkum et al\ 2012| ). 

At the basis of MALDI-TOF MS identification system is therefore a software component in charge 
of finding the closest match between the fingerprint of the unknown microorganisms and the reference 
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fingerprints of the database. From the data analysis perspective, this can be formalized as a multiclass 
classification task. This learning task presents several challenging issues. First, MALDI-TOF mass 
spectra are measured on several tens of thousands of mass to charge channels, and although they are 
generally pre-processed in order to extract their predominant peaks ( [Coombes et a/. [[2007 1, the resulting 
peak lists are still high-dimensional vectors. Moreover, current commercial systems like the Biotyper 
(Bruker Daltonics, Germany), LT2 (Andromas, France), or VITEK-MS (bioMerieux, France) address 


several hundreds of species ( [Martiny et al.\ |2012[ ), which constitutes a relatively massive multiclass 
problem. Finally, the number of observations per class, that is, of representative strains per species, 
is often limited, which leads to strongly unbalanced datasets. On the other hand, the classes of the 
problem correspond to microbial species which can be organized into well known hierarchical structures, 
generally defined in terms of evolutionary distances and/or phenotypic differences. Such tree structures 
provide a rich source of information that could be added as prior knowledge within the training of 
automatic microbial identification systems. Several "structured" machine learning methods were indeed 
recently proposed for taking into account the structure embedded in a hierarchy and using it as additional 
a priori information (Hofman et al.\ 2003 Tsochantaridis et al. 2005; [Sun et al.\ 200 Dumais et al. 


20001, and could potentially be used to train microbial identification systems. Surprisingly, however, this 


possibility has not been investigated to our knowledge, and current systems implement "flat" multiclass 
classification algorithms that do not take into account the known tree structure. In this paper, we evaluate 
the relevance of structured machine-learning methods in the context of microbial identification from 


MALDI-TOF mass spectra. For that purpose, we use the MicroMass dataset (Mahe et al. 20141 to 
benchmark several "flat" and "structured" machine learning techniques. 


2 Material and Methods 


2.1 Benchmark dataset 

The dataset considered in this benchmark is described in Table [T] It involves 20 Gram positive and 
negative bacterial species covering nine genera. This dataset was extracted from the reference database 
embedded in the commercial VITEK-MS system and made public through the UCI machine learning 
repositor)!^. Each species is represented by 11 to 60 mass spectra obtained from 7 to 20 bacterial strains, 
leading altogether to a dataset of 213 strains and 571 spectra. These spectra were obtained according to 
the standard workflow used in clinical routine in which the microorganism was first grown on an agar 
plate from 24 to 48 hours, before some colonies were picked, spotted on a MALDI slide and a mass 
spectrum was acquired. 

The 20 bacterial species involved in this study and the underlying hierarchical tree are shown in 
Eigures[T]and[^ 

We note that the tree considered in this study involves both phenotypic and evolutionary traits, its 
uppermost level separating species into Gram positive and Gram negative, and its two lowest levels 
corresponding to the species and genus taxonomic ranks. Such a hybrid hierarchical definition is com¬ 
mon in the context of clinical microbiology, where manual identification involves a succession of tests 
to establish several phenotypic and metabolic properties of the microorganism to identify (e.g.. Gram 
+/- or aerobe/anaerobe). These properties correspond to the upper levels of the tree, while the lower 
ones correspond to standard phylogenetic ranks (e.g., family, genus and species). We also note that this 
dataset contains several pairs of groups of species known to be hard to discriminate in general. This 
is for instance the case of the Bacillus cereus and Bacillus thuringiensis species, which are known to 
belong to the Bacillus cereus group ([Helgason et al.\ 20001, as well as the Escherichia coli species 


and the species of the Shigella genus, which are often considered to belong to the same species (Lan 
et al.\ 2002| ). Accordingly, Escherichia coli and the three Shigella species involved in this dataset were 
gathered in a common genus. 


’http://archive.ics.uci.edu/ml/datasets/MicroMass 
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Figure 1 : MicroMass hierarchical tree structure (Gram + bacteria). This tree shows the hierarehieal 
organization of the baeterial panel eonsidered in this benehmark, that belong to the Gram + baeteria. The 
leaves of the tree eorrespond to the 8 speeies and their parent to the 4 genera. Internal nodes eorrespond 
to either phenotypie (e.g. aerobie and anaerobie at the top of the tree) or taxonomie attributes. 
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Table 1: MicroMass dataset. This table deseribes the MieroMass dataset eontent, in terms of used 
bacterial genera and species. It also provides information on the number of bacterial strains and mass- 
spectra for each species. 


Species name 

Species ID 

Number of strains 

Number of spectra 

Bacillus cereus 

BAC.CEU 

10 

26 

Bacillus thuringiensis 

BAG.THU 

8 

11 

Citrobacter braakii 

CIT.BRA 

9 

26 

Citrobacter freundii 

CIT.FRE 

10 

28 

Clostridium difficile 

CEO.DIE 

7 

14 

Clostridium glycolicum 

CEO.GEY 

9 

16 

Enterobacter asburiae 

ENT.ASB 

10 

29 

Enterobacter cloacae 

ENT.CEC 

16 

52 

Escherichia coli 

ESH.COE 

20 

60 

Haemophilus influenzae 

HAE.INE 

18 

50 

Haemophilus parainfluenzae 

HAE.PAR 

9 

21 

Listeria ivanovii 

EIS.ISI 

9 

29 

Listeria monocytogenes 

EIS.MNC 

10 

31 

Shigella boydii 

SHG.BOY 

9 

18 

Shigella flexneri 

SHG.EEX 

10 

32 

Shigella sonnei 

SHG.SON 

10 

31 

Streptococcus mitis 

STR.MIT 

10 

26 

Streptococcus oralis 

STR.ORA 

9 

24 

Yersinia enterocolitica 

YER.ETC 

10 

27 

Yersinia frederiksenii 

YER.ERD 

10 

20 



Figure 2: MicroMass hierarchical tree structure (Gram - bacteria). This tree shows the hierarchical 
organization of the bacterial panel considered in this benchmark, that belong to the Gram - bacteria. The 
leaves of the tree correspond to the 12 species and their parent to the 5 genera. Internal nodes correspond 
to either phenotypic (e.g. aerobic and anaerobic at the top of the tree) or taxonomic attributes. 
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We note finally that we have considered in this study a peak-list representation in which a mass 
spectrum is represented by a vector x G where p is the numbers of bins considered to discretize 
the mass to charge range, and each entry of x is derived from the intensity of the peak(s) found in the 
corresponding bin. While several schemes have been proposed to define such a peak-list representa¬ 
tion ( Coombes et al.\ 20071, we have relied here on the approach embedded in the VITEK-MS system, 
which provides a peak-list representation of dimension p = 1300, with typically between 50 and 150 
peaks per spectrum. Further details about this dataset are available in (|Mahe gf al]|2014[). 


2.2 Classification methods 

In this section we provide a brief description of the various classification strategies considered in this 
study. We refer the interested reader to the original publications for a more detailed presentation. We first 
describe the standard support vector machine (SVM) algorithm and its multiclass extensions. We then 
describe three approaches to leverage information about a hierarchical structure reflecting the proximity 
of bacterial species: a cost-sensitive SVM formulation, a hierarchical SVM formulation, and a divide- 
and-conquer or "cascade" strategy. Finally, we present other standard machine learning algorithms not 
based on SVMs, such as nearest-neighbours and random forests, that were included in the benchmark. 


Multiclass SVMs In its original form, the SVM algorithm is a binary classification algorithm ( [Cortes 
I. It aims to build a classification rule to classify instances from the space Af = into 
commonly referred to as positive or negative. In its simplest linear form, the SVM al¬ 
gorithm builds a hyperplane separating the vector space X in two half-spaces. For that purpose, the 
SVM algorithm relies on a training dataset of N instances xi,... ,X]y G X with their associated labels 
yi,... ,yN G { — 1,1}, and seeks to correctly classify the training data while maximizing the margin 
of the hyperplane, which is defined as the smallest distance between the hyperplane and the training in¬ 
stances. These two criteria are hard to fulfill simultaneously, and in practice the SVM algorithm achieves 
a trade-off between these two objectives. When instances must be classified into K > 2 classes, an 
extension of this binary SVM is needed. The most standard way to address multiclass classification 
problems with SVMs is to train and combine several binary classifiers into a mutliclass classification 
rule. Two popular schemes allow to do so: 


et al. 1995 


two classes. 


• the one versus all scheme (SVM-OVA), where a set of K binary SVMs is trained to separate each 
of the K classes from the K — 1 other ones, leading to a set of K predictors. To predict the class of 
a new instance x, each of the K predictors is applied to x, and the one predicting with the largest 
margin is the winner. 

• the one versus one scheme (SVM-OVO), where 7f(iT—l)/2 binary SVMs are trained to distinguish 
between every pair of classes. The class predicted for an instance x is the one obtaining the highest 
number of votes (a number between 0 and K — 1) according to these classifiers. 


In a similar spirit to the one-versus-all scheme, an alternative strategy proposed by (jTsochantaridis et oT. 


20051 and referred to as Multi class below, consists in learning a set of class specific classifiers. 


but to combine them in a single model and to learn them simultaneously. This formulation requires 
to solve a single optimization problem, but with a greater number of constraints than each individual 
SVM involved in the one-versus-all formulation. In practice, efficient algorithms enable to obtain an 
approximate solution of this problem (Tsochantaridis et al. 20051. This formulation paves the way 
to the development of cost-sensitive and so-called structured classifiers, that we introduce in the two 
following sections. 


Cost-sensitive multiclass SVMs For practical applications, different errors can have different impact: 
it can be less severe to mistake class A for class B than class A for class Z for instance. This is notably 
the case for microbial identification which can orient therapy before antibiotic susceptibility results 
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are available. Cost-sensitive classifiers distinguish between the various types of classification errors 
and penalize them differently in the learning process. The above multiclass formulation can be easily 
modified fo accommodafe such a cosf-sensifive mechanism ( [Tsochanfaridis et ^|2005| ). Indeed, assume 
fhat a loss funclion A : 3^ x —)> M is available]^ such fhaf A{y,y') quantifies fhe loss, or severify, of 
predicting class y' if fhe frue class is y, A{y,y') > 0 for y 7 ^ y' and A(y,y) = 0. Such a loss 
funclion can be leveraged in fhe fraining process fhrough a redefinilion of fhe consfrainfs involved in 
the underlying optimization problem. This redefinition has the effect of adjusting the strength of the 
constraints according to the loss function. Note that the standard formulation corresponds to using a 
binary loss function: A(y, y') = 1 for y 7 ^ y' and A(y,y) = 0. For practical applications, this cost- 
sensitive formulation allows to leverage the training process prior information about the relationship 
between the classes and/or requirements about the classification performances expected. In this study 
we call this approach TreeLoss and use A(y, y') as the length of the shortest path connecting the two 
species in the considered tree. 


Hierarchy structured S VMs The structured S VM formulation of (Tsochantaridis et al.\ 2005 1 enables 


to make a further use of the hierarchical structure underlying the microbial identification multiclass 
problem. Indeed, it does not only allow to leverage a loss function in the learning process to penalize 
misclassifications involving hierarchically distant species, but it also introduces new variables that can 
be further exploited by the algorithm. Loosely speaking, the original variables are repeated as "blocks" 
for each node of the tree. These variables are then turned on or off, for each observation, depending 
on its position on the hierarchy : every block of variables associated to a node that belongs to the path 
connecting the root node to the leaf node corresponding to the observation label is turned on, and is 
turned off otherwise. As a result, the closer two observations are in the tree, the more variables they 
share, which can be used by the algorithms to learn coarse to fine association rules (Hofman et al.\ 2003} 


Tsochanfaridis et al. 20051. This approach, which we refer fo as Structured below, and fhe cosf- 


sensifive mulficlass formulafion introduced above have in common to leverage a loss function to take 
into account the severity of the classification errors. This property can be expected to increase the quality 
of the predictions made by these algorithms, which will be trained to avoid "severe", that is, high loss, 
classification errors. As in the previous paragraph, we define A(y, y') as fhe lengfh of fhe shorfesf pafh 
connecfing nodes y and y' in fhe free, hence direcfly define fhe notion of severify as fhe free disfance. 


Cascade approach The lasf SVM-based sfrafegy we consider is a divide and conquer approach where 
a SVM classifier is learned al each infernal node of fhe free fo assign a speclrum fo one of ils children. A 


lop-down approach is Ihen used fo classify a speclrum fo a leaf node by fhis cascade of classifiers (Sun 


et al. 200 ![ Dumais et al.\ 2000 1 . Allhough any lype of classifier can be considered al each node. 


we choose fo rely on SVM in fhis sludy and call fhe resulfing melhod Cascade-of-Classifiers (CoC). 
Finally, following ( Benabdeslem et aL\ 2006| l, we consider a varianf of fhis approach in which fhe free 
used fo define fhe cascade is oblained in a preliminary sfep of unsupervised cluslering carried ouf from 
species-specific profolypes. We refer fo fhis approach as Dendrogram-SVMs (DSVM). 


Other classification methods Finally, we consider three methods not based on SVMs in this bench¬ 
mark. Random forest (RF) and similarity-based approaches have indeed already been successfully used 
in the context of MS data classification ( De Bruyne et al.\ 2011} |Ta§km et al.\ 2013| l. We therefore 
include in the benchmark the RF method described in ( Breiman]|2001| ), referred to as RF, which con¬ 
sists in learning many decision trees and predicting with a majority vote strategy. We also evaluate two 
similarity-based approaches: a 1-nearest-neighbour (1-NN) and a 1-nearest-centroid (1-Centroid) 
approach. In the 1-NN method, a new spectrum is classified in the same class as its closest spectrum in 
the training set. The same classification rule is applied in the nearest-centroid approach, the centroid of 
a given species being defined as its median spectrum. 


^Note that in practice this loss function can he summarized as a A x K matrix. 
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Experimental setting 


We evaluate the classification performance of the various methods by cross-validation. To define fhe 
cross-validafion folds we lake info accounf fhe slrain informafion. Indeed, fhe dalasel consisls of 571 
speclra obfained from 213 slrains, wilh in average less lhan 3 and up fo 6 speclra per slrain. The vari- 
abilily observed wilhin fhe replicafe speclra of a given slrain is purely technical, and is Iherefore lower 
than the level of variability that is expected in clinical routine, where an additional level of biological 
variability is expected due to the fact that the microorganisms to identify differ from that used to learn 
the classification model. To mimic this setting, hence to avoid optimistic evaluation of classification 
performance, we therefore affect spectra of a given strain to the same cross-validation fold. In this study, 
we actually resort to a leave one strain out cross-validation strategy in which a single strain is kept aside 
at each step, thus leading to a 213-fold cross-validation set up. 

To assess the classification performance, we primarily consider an accuracy criterion. However, 
since each species of the dataset is represented by a varying number of strains and each strain by a vary¬ 
ing number of spectra, we adopt a nested definition of accuracy criterion, instead of classical proportion 
of correct classifications. We first define a strain-level accuracy as fhe proporfion of specfra fhaf are 
correcfly classified for each sfrain, and a species-level accuracy as fhe average sfrain-level accuracy for 
each species. The overall accuracy indicafor is fhen defined as fhe average species-level accuracy. In 
order fo compare fhe benchmarked approaches, we rely on fhe fwo-sample Kolgomorov-Smirnov fesf 
( Kolmogorov! [1933} Smirnov 19391 applied on vectors of 20 accuracies af fhe species level. The sec¬ 
ond performance indicator we consider is fhe disfribufion of fhe free loss of misclassificalions, which 
Iherefore quanlifies fheir severify. As can be read from Figure [T] fhis loss can vary in fhis sfudy from 
2 to 12, when a species is respecfively misfaken for a species of fhe same genus or of fhe ofher Gram. 
Because fhese fypes of errors are easier fo inferpref fhan summary sfafisfics of fhe free loss disfribufion, 
we reporf fhe proporfion of errors fhaf fall in fhe following cafegories: "wifhin-genus error" (A = 2), 
"oufside genus buf same Gram error" (2 < A < 12), "disfincf-Gram error" (A = 12). 

The regularizafion (C) paramefer of fhe various SVM formulations considered in fhis sfudy was 
opfimized wifhin each fold of fhe leave one slrain ouf cross-validafion process by an inner 10 fold cross- 
validafion. As before, specfra of fhe same sfrain are syslemalically affecled fo fhe same fold. The grid of 
candidate values was sef to {10“®, 10“^,..., 10^, 10®} and fhe value was chosen to maximize fhe nested 
accuracy indicator defined above. The sfandard and cascade SVM approaches (SVM-OVA, SVM-OVO, 
CoC and DSVM) were implemented using fhe Rpackage Liblineal^ For fhe Iwo cascade approaches 
(CoC and DSVM), one-versus-all classifiers were frained al each infernal node of fhe hierarchy. The 
free involved in fhe DSVM mefhod was generated by fhe he lust function of fhe R package stats, 
wilh a complete linkage clustering mefhod. The Multiclass SVM implemenlalion relies on fhe C 
library SVM-light (Joachims et a/.[|1999 1. The cosl-sensilive (TreeLoss) and Structured SVM 


formulations were implemented based on fhe C library SVM-struct ( [Joachims et af.||2009| l. We have 
relied on fhe slack-rescaling approach fo inlegrale fhe loss funclion A in fhe learning process. We have 
moreover considered a precision of e = 0.1 on fhe solution and used fhe 1-slack algorilhm operating in 
fhe dual (option w=3). 

The hyperparameters of fhe alfemalive slralegies were sef from preliminary experimenls. We relied 
on fhe R package randomForest to build RF models. The number of frees (nlree) and variables per 
free (miry) of fhe random fores! were respectively sef fo fhe defaulf value of 500 and fo 36, according 
to fhe sfandard heuristics miry = Preliminary experimenls revealed fhaf fhese parameters had little 
influence on fhe resulls as long as Ihey were sufficienlly high, especially nlree. Regarding similarily- 
based melhods, fhe number of neighbours fo consider in fhe nearesl neighbours was sef to 1 and fhe 
Euclidean disfance was used. The choice of fhe disfance criferion had little influence on fhe resulls buf 
performance decreased when fhe number of neighbours increased. The Euclidean disfance was used for 
fhe nearesl cenlroid approach as well. 

We nole finally fhaf fhe fealure veclors were syslemalically scaled fo unif Euclidean norm. 


- http://cran.r-project.org/web/packages/LiblineaR/ 
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method 

accuracy 

# correct 

# within-genus 

# within-Gram 

# distinct-Gram 

1-NN 

76.8 

442 

119 

6 

4 

1-Centroid 

78.8 

445 

104 

7 

15 

RF 

84.0 

494 

63 

12 

2 

SVM-OVO 

86.6 

506 

52 

13 

0 

SVM-OVA 

88.9 

514 

50 

4 

3 

Multiclass 

88.9 

516 

47 

4 

4 

Treeloss 

89.3 

517 

47 

3 

4 

Structured 

89.4 

517 

47 

4 

3 

CoC 

88.6 

505 

55 

11 

0 

DSVM 

87.1 

507 

56 

2 
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Table 2: Cross-validation results on MicroMass dataset. This table summarizes the cross-validation 
results obtained for each benchmarked method. The accuracy measure corresponds to the nested accu¬ 
racy definition. The four following figures explicitly give the numbers of correct prediction, of within- 
genus errors (for which a species was mistaken for a species of the same genus), of within-Gram errors 
(for which a species was mistaken for a species of another genus of the same Gram) and of distinct- 
Gram errors (for which a species was mistaken for another species of the other Gram). Method names 


are specified in the main text of section 2.2 


Results and Discussion 


The results of the benchmark experiment described in the previous sections are summarized in Table 
Considering the overall accuracy obtained by the various methods, we first note that SVM classi¬ 
fiers, with an accuracy ranging from 86.6 to 89.4%, outperform random forests (accuracy of 84%) and 
similarity-based approaches (accuracy of 76.8% and 78.8% for the nearest neighbour and nearest cen¬ 
troid approaches respectively). In both cases, these differences are significant (P-value < 0.05). Among 
the different SVM formulations, we see that the best structured SVM (Structured), with an accuracy 
of 89.4%, outperforms the best "flat" SVMs (SVM-OVA and Mult id ass), which reach an accuracy of 
88.9%. This difference, however, is not significant (P-value > 0.05), suggesting that the more elaborate 
structured SVMs are not particularly useful for this application. 

This being said, a closer look at the nature of the misclassifications, given in Table reveals 
some slight differences between the various SVM strategies. We note indeed that while SVM-OVA, 
Multiclass, TreeLoss and Structured make fewer errors than SVM-OVO and CoC (54 to 57 
versus 65 to 66), some of these errors involve mistaking a species for a species of the other Gram, which 
never occur with SVM-OVO and CoC. This however comes at the price of an increased proportion of er¬ 
rors involving mistaking a species for another one of the same Gram but of another genus, and therefore 
suggests that a trade-off between the number and the severity of the classihcation errors can be achieved. 
In a similar spirit, we observe some discrepancies between the results provided by the two cascade ap¬ 
proaches (CoC and DSVM): while the two approaches lead to a similar number of classihcation errors, 
DSVM leads to a higher rate of uncorrect Gram errors for a lower rate of distinct genus but same Gram 
errors. These two methods only differ in the tree considered, which therefore suggests that its structure 
has indeed an important role in the learning process and that it could be optimized ( |Song et 20071. 

Finally, a striking observation that can be made from Table is that the great majority of errors in¬ 
volve predicting a species for a species of the same genus, for any considered method. While this makes 
sense from a biological point of view, this raises at least two hypotheses to explain why the structured 
methods considered in this benchmark, and in particular those derived from the structured SVM for¬ 
malism (TreeLoss and Structured), did not bring any improvement over their "flat" counterparts. 
First, we note that with a loss function A(y, y') defined as the length of the shortest path between species 
y and y' in the tree, this type of error is the less penalized one. While this is indeed a natural and relevant 
definition, it can mainly be expected to limit the number of errors involving remote pairs of species, and 












Common misclassifications in the benchmarked methods 

102 



Figure 3: MicroMass dataset; Common classification errors. Each bar represents one of the most 
frequent confusions observed across all evaluated classification approaches. 


hardly to improve over a "flat" strategy that does a limited number of errors of this kind, as this is the 
case for SVM-OVA for instance. On the other hand, it may also be the case that the tree considered in 
this study is not informative below the genus level. As mentioned previously, the dataset considered in 
this study involves several pairs or groups of species that are known to be hard to discriminate in general, 
and by MALDI-TOF MS in particular. 

Figure shows the counts of the most common types of misclassifications obtained across all the 
methods considered. It reveals that five pairs or groups of species proved to be particularly challenging: 
Bacillus cereus / B. thuringiensis. Streptococcus mitis / S. oralis, Enterobacter asburiae / E. cloacae, 
Citrobacter braakii / C. freundii, and the group defined by E. coli and the three Shigella species. It 
also shows that E. coli and Enterobacter cloacae, that do not belong to the same genus but both to the 
Enterobacteriaceae family, are relatively often mistaken. The biological proximity within some of these 
pairs or groups of species may in fact be beyond what can be captured by the MAFDI-TOF technology. 
The B. cereus and B. thuringiensis species are for instance known to belong to the Bacillus cereus group, 
which is sometimes considered to define a single species ( Helgason etal.\ 20001 and other studies indeed 
suggest that they cannot be discriminated by MAFDI-TOF ( Fasch et al.\ 20091 . Streptococcus mitis and 
Streptococcus oralis are also part of similar group comprising more than 99% 16S rRNA similarity 
(Kawamura et al. 19951, and MAFDI-TOF mass-spectrometry is known to be hardly able to distinguish 
them properly ( [Williamson et al.\ 20081. 

Figurej^illustrates the fact that mass spectra obtained in this study from B. cereus and B. thuringien¬ 
sis are hardly distinguishable, at least when they have undergone the process of peak extraction, as 
opposed to the spectra obtained from Clostridium difficile and C. glycolicum, that are almost never mis¬ 
taken one for the other. 
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Figure 4: MicroMass dataset: Mass-spectra clustering at the genus level. Left: Bacillus genus. 
Right: Clostridium genus. Mass-spectra (rows) belonging to a given genus are clustered according to 
their peak lists (columns). For clarity purpose, we removed features equal to zero among all the genus 
mass-spectra. 


Conclusion 


We evaluated several structured methods in the microbial identification context, using mass-spectrometry 
data. Our results suggest that methods exploiting the underlying bacterial hierarchical structure perform 
as well as standard "flat" methods. We noted in particular that the majority of classification errors 
obtained by all the methods considered in this benchmark are within-genus misidentifications. We pos¬ 
tulate that the structured methods considered in this benchmark are not tailored to improve flat methods 
for this type of errors. Unfortunately, a larger panel of strains with a careful definition of the reference 


identification would be required to validate this hypothesis. (Zhou et al. 20111 recently proposed a 
structured regularization method specifically designed to cope with this issue, and it would therefore be 
interesting to evaluate its relevance in this context. 
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